Refactor: Decouple Core Logic into a Reusable Library #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors the project from a monolithic script into a well-defined, reusable library. The core training, data handling, and tokenizer management logic have been extracted from
train.pyinto decoupled, object-oriented components. The goal is to create a clean API that can be easily used and extended in other projects.The
train.pyscript is now a simple command-line client that demonstrates how to use the new library components.Key Changes ✨
TrainerClass: A newscratchgpt/training/trainer.pymodule introduces theTrainerclass, which now encapsulates all logic for training loops, validation, pre-tokenization, and model checkpointing.DataSourceProtocol: A new, flexiblescratchgpt/data/datasource.pymodule defines a protocol for data loading. We've included concreteFileDataSourceandFolderDataSourceimplementations, replacing the oldTextProviderclasses.get_tokenizerfunction inscratchgpt/model_io.pyhas been updated to use a factory pattern. This makes creating a default tokenizer more robust and explicit.tests/directory (test_tokenizer_io.py,tests/tokenizers/..), improving maintainability.train.pyscript now uses a--tokenizerargument to dynamically load any tokenizer from the Hugging Face Hub, making it significantly more versatile.Highlights for Review 🔍
When reviewing, please pay special attention to:
TrainerAPI: This is the new heart of the library. Is its interface clear? Does it correctly encapsulate the training logic?DataSourceProtocol: This is our core data abstraction. Is it flexible enough for future use cases?get_tokenizerFactory Pattern: Review the new signature inmodel_io.py. This is a key design pattern for how we manage object creation.train.py: As the first client of our new library, does it demonstrate a clean and intuitive workflow?